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The purpose of this essay is to explore both what we know and what we need to learn 
about the information value of performance items when they are used in large scale 
assessments. Within the context of the National Assessment of Educational Progress 
(NAEP), there is substantial motivation for answering these questions. Over the past 
decade, in order to adequately portray the breadth and depth of important curriculum 
standards, NAEP designers have invested substantial time and energy in creating 
extended constructed-response items and enormous financial resources in scoring these 
items. While these items are popular with curriculum experts within various content 
areas (Linn, Glaser, & Bohrnstedt, 1997), it is not clear whether they possess the 
marginal utility required to justify their cost; that is, they may not provide new 
information above and beyond that which is provided by a more standard mix of 
multiple-choice and short items with known measurement characteristics and much more 
economical scoring protocols. Even worse, there is some reason to believe, based on 
research in non -NAEP settings, that extended constructed-response items may provide 
negative returns in terms of the overall goal of accurate measurement of performance on 
some broad domain such as reading, mathematics, or science (see Forsyth, Hambleton, 
Linn, Mislevy, and Yen, 1996). 

To raise this issue is not to question the inherent validity of these tasks within internal 
assessment systems (i.e., as they are embedded within particular curricula), but only 
whether they have a role in the “drop-in-from-the-sky” assessment approach of external 
assessments such as NAEP (Forsyth, et al, 1996). In order to prove worthy of the added 
design and scoring costs in the latter context, however, they must be shown to play a 
significant role in improving the accuracy and/or the credibility of these externally 
motivated large-scale assessments. 

A more specific version of this broad question is contained in the initial charge provided 
to us by the panel overseeing a set of validity studies to guide the future development of 
NAEP (AIR, 1995): 

This paper would review what is currently known about the psychometric 
properties of performance items in large scale assessment generally, and in 
NAEP in particular. The paper would then go on to propose modifications 
that appear to hold the greatest potential for improving the information 
value of such items and to lay out the design, time line, and budget for one 
or more empirical studies to explore (a subset of) these modifications. 
Possible candidates for study might include ways of obtaining multiple 
data points from single exercises, refining item directions and scoring 
rubrics to improve alignment with content framework constructs, 
improving the different characteristics of such items, etc. The costs of 
large scale administration and scoring will he taken into consideration in 
evaluating alternatives for further study. 

To dispatch our charge, we have traversed the “information value” terrain along as many 
paths as we could find — combing the measurement literature to determine the various 
ways in which scholars have conceptualized and operationalized the “information value” 
construct, reviewing research conducted within those approaches, consulting essays that 
emphasize the importance of strong conceptual grounding in content frameworks, 
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studying the broadest available construal of the construct of validity (e.g., Messick, 1989), 
and, finally, and most unsatisifyingly, attempting to determine, at a conceptual and 
philosophical level, what the assessment community means when they talk about the 
information provided by assessments. 

We have organized our essay into four sections, First, we consider the construct of 
information value in its broadest philosophical sense and then describe the classical ways 
of operationalizing it. Second, we review the available literature within each of these 
operational traditions (IRT, factor analysis, correlational studies). Third, we consider 
some alternative versions of information value, based more on cognitive, conceptual, and 
pragmatic considerations. Finally, we outline a series of studies that we believe ought to 
be supported by the National Center for Educational Statistics (NCES) in order to 
answer the question of how NAEP can be modified to increase its “information value.” 


The Idea of Information Value 

Information has “value” to the degree that it helps us make decisions or draw conclusions 
about issues or questions that matter to us. If we have only a single datum available to 
make a decision, then the question of value added (what marginal information do I get?) 
is moot because “what you see is what you get.” But as soon as an additional datum is 
available, the question of value added becomes important, and the common sense 
question is whether we will make a better (i.e., more valid or more accurate) decision 
with this added information available to us. 

When it comes to educational measurement, and in particular large scale educational 
measurement, the issues and questions that “matter to us” usually take the form of queries 
about cognitive processes or curriculum, such as, 

• How well does this population read? 

• What does this population know about mathematical problem solving? 

• How much does this population know about American history? 

A datum, or bit of information, within a psychometric framework could be a score on a 
test or a score on a test item. The question of interest, vis-a-vis information value, is 
whether an additional hit of information helps us better answer the question(s) of 
interest. Ultimately, however, the issue of how new information adds value must be 
addressed, and the answer is not always straightforward. New information can serve 
several functions in the decision-making processes of individuals or groups. 

1. First, new information could increase our confidence in the decision we have 
already made or are about to make. Psychometrically, this is tantamount to 
increasing the psychometric precision of our measurement. Within the par- 
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lance of Item Response Theory (IRT), this is typically what is indexed by the 
information function. By ascertaining the item information function for 
groups of items, we can make decisions about the ability level on the underly- 
ing trait at which particular sets of items provide maximal information (or, if 
you prefer, maximal capacity to discriminate among individuals with differ- 
ent levels of the ability in question). IRT analysis has occasionally been 
employed to compare the relative information contributions of multi- 
ple-choice and construe ted-response items. For example, Bridgernan (1992) 
examined the item characteristic curves (ICCs) for mathematics items (on 
the GRE quantitative scale) that appeared in both multiple-choice and con- 
structed-response (filling in spaces on a grid) format. He found that the mul- 
tiple-choice items were generally easier but the relative difficulty of paired 
items (one me item and one cr item measuring precisely the same content) 
varied unpredictably across ability levels. 

2. “New” information focuses on novelty rather than precision. In other words, 
the additional information leads to the elaboration of a new construct or 
dimension. Thus, instead of increasing the precision of our measurement of a 
particular construct, such as mathematical power, an additional item, subtest, 
or format could give us information about a related constmct, such as mathe- 
matical problem solving. Several studies have been conducted to determine 
the value added in this sense of new information. Factor analysis is the most 
common tool used to calibrate the contribution of hypothetically separate 
items or item sets (do items in different formats load on different factors?), 
although occasionally researchers use correlational or regression analysis to 
examine the relative distributions of common and unique variance among 
different sets of items. For example, Ackerman and Smith (1988) examined 
the relationships of several multiple-choice and direct forms of writing assess- 
ment and concluded that a six-factor model, corresponding roughly to the 
different forms of writing assessment included, provided the best fit to the 
data. By contrast, Ward, Dupree, and Carlson (1987), applying similar statis- 
tical analyses to reading comprehension tests, found only weak evidence for 
unique components, based either on format or cognitive demands. 

3. New information could be thought of as more psychological than statistical 
in character. For example, additional items or item types could provide us 
with absolutely no new statistical information (i.e., neither an increase in the 
precision of test information about a given dimension nor any information 
about a new dimension) but still give us greater confidence that we had mea- 
sured the domain of interest adequately. This added confidence could stem 
from one of two sources, both of which are conceptual rather than technical. 

In the first instance, trust is at issue. Suppose we have a complex domain, 
such as reading comprehension, which is known to have several aspects or 
dimensions, such as literal, inferential and critical stances toward a text. A 
small (representative) sample of items from the domain might leave us with 
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the concern that not all of the aspects of the domain had been adequately 
represented, even though it provided as reliable an estimate of domain ability 
as a much larger sample would show. Increasing the sample of items would 
contribute to a greater sense of “trust” without any increase in technical 
information. In the second instance, the cognitive demands of the tasks are 
at issue. Snow ( 1993) makes this point vividly in commenting about the rela- 
tionship or lack of relationship between psychometric and psychological 
equivalence: 

... a constructed-response test and a multiple-choice test 
may correlate in some student population about as high as 
their respective reliabilities will allow; this fact may permit 
the two tests to be considered psychometrically equivalent 
for use in rank ordering students . . . But the two are not 
psychologically equivalent; we only act as if they were . . . 

(p. 46). 

Implicit in Snow’s assertion is an assumption that in order for two tests (or 
items) to he psychologically equivalent, they must engage test-takers in the 
same cognitive and/or affective processes. The discontinuity between 
psychometric and psychological constructs is not a trivial matter. Ultimately 
it is a question of constmct validity, and, as such, involves an examination of 
the validity of the original framework used to characterize the constmct, the 
practices used to translate the framework into test items, and the uses of 
information obtained from the test. 

4- New information could arise from taking a second perspective on data or 

tasks that had already been examined from a different perspective. For exam- 
ple, suppose an essay written for a social studies, science, or mathematics per- 
formance was examined initially with a rubric designed to calibrate the 
amount of specific, subject matter-learning demonstrated by students. On a 
second pass, the same essay could be examined with a writing rubric measur- 
ing accomplishment on writing dimensions such as voice, power, audience, 
and mastery of conventions. The additional data generated by the second 
scoring could be “useful” and provide evidence for making an additional deci- 
sion even if it were not statistically independent of data from the first scoring. 
For example, in the work of New Standards (Myers & Pearson, 1996), stu- 
dents participated in three- or four-day integrated language arts tasks in 
which they read texts, responded to them, discussed issues in groups, and, 
finally, wrote in response to specified prompts. The culminating writing 
responses were scored first for writing power and then for evidence of stu- 
dents’ ability to synthesize information from text. Similarly, the Maryland 
performance assessment (1992) allows for a single set of responses to be 
scored initially for science, mathematics, or social studies and later for either 
writing prowess or reading comprehension. A variant of the second scoring 
approach is the tradition of dimensional or primary trait scoring currently- 
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popular in writing assessments. The idea is to make several passes through 
each constructed-response item or paper, each time scoring it holistically 
(i.e., considering the whole essay) but for a different dimension. 

5 . Information could be improved in a pragmatic sense without any increase in 
the quality of the psychometric information or even an increase in the quab 
ity of information about the underlying psychological construct. New infor- 
mation could provide us with a new perspective on a particular decision. For 
example, in analyzing the results of a mixed (performance and multi- 
pie-choice) battery of geography assessments, Kon and Martin-Kniep (1992) 
wanted to know, “Do we learn things about students’ knowledge of geography 
and ability to use geography skills that we would not learn using other testing 
methods?” Their criterion for learning new things derived from common 
sense (is this a new type of task or activity?) and weak statistical comparisons 
(are the two types of items reasonably unrelated, as indexed by correlation 
coefficients?). In a study of mathematics performance cited earlier, Bridge- 
man (1992) concluded that while providing no new information about over- 
all performance in the domain (there was a high degree of common 
variance), the gridded items could provide readily information about specific 
skills within the domain or about dominant error patterns (should anyone 
care to create subscale scores or conduct error analyses). The arena for deter- 
mining whether a test provides new or better information by this criterion is 
neither psychometric nor psychological analysis; it can only be assessed 
within a context, usually instructional, in which people use a variety of 
assessment information to make real decisions of consequence for others. It 
might turn out, for example, that constructed-response items, or the factor on 
which they load, are correlated at unity with multiple-choice items, but so 
dramatically increase the confidence and/or quality of decisions that they 
constitute an indispensable piece of a broader “assessment system.” In every- 
day parlance, this might be equivalent to feeling confident that the individu- 
als who were being assessed both possessed and could apply certain domains of 
knowledge. 


The Information Value of Performance Assessments 

Within the corpus of wide-scale research on the constructed-response/ multiple-choice 
relationship, the question of interest has been whether constructed-response items, when 
used with multiple-choice items in a mixed format package, add “value” (i.e., additional 
information) to processes of understanding, reporting, and using assessment results. For 
our purposes, it is more informative to decompose the overall question into a two part 
question: a) do constructed-response items provide us with more information about what 
students are capable of doing than we would get from multiple-choice items alone, and, if 
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so, b) what types of skills are tapped by the constructed-response items that are not 
measured by multiple-choice items? 

Much of the archival literature evaluating the value added of performance items arises 
from analyses of data from the College Board’s Advanced Placement (AP) exams. This 
should not be surprising since the AP exams have a long history of using both sorts of 
items, presumably because designers and users believe that not all higher order skills can 
he assessed easily with multiple-choice items. 

Working within an IRT framework, Lukhele, Thissen, and Wainer (1994) found that 
constructed-response items on the test provided little information beyond that which 
multiple-choice items yielded. This result was the same for both the chemistry and U.S. 
history exams analyzed in that study. In fact, for the chemistry test, twice as much 
information was yielded from 16 multiple-choice items compared to one constructed- 
response item when a three -parameter logistic IRT model was utilized. Of course, given 
what we know about the influence of test length (as indexed by the number of items) on 
information value (e.g., Hambleton & Swaminathan, 1985), it should not be surprising 
that a 16-item test provides more information than a one-item test. Nevertheless, when 
test length is indexed by total testing time, the 16 multiple-choice items are equivalent 
in length to the one constructed-response item and cost much less to score, indicating that 
from a cost-effectiveness perspective, the multiple-choice items are preferable. 

Additional studies involving AP exams in chemistry, science, and computer science 
(Thissen, Wainer, & Wang, 1994; Wainer & Thissen, 1993; Wainer, Wang, & Thissen, 
1991; Wang, Wainer, & Thissen, 1993) have further explored the value added question. 
The evidence from all of these studies suggests that when data from performance items 
are combined with data from multiple-choice items, little new information about the skills 
being assessed is added. 

In other content areas, however, some evidence to the contrary has been uncovered. In 
the domain of writing, Werts et al. (1980), working with first-year college students, 
attempted to determine whether different item formats on six tests would uncover 
different writing traits. The design was a variation of multitrait-multimethod. The 
instruments included three administrations of the Test of Standard Written English 
(TSWE) and three, short (20 minute) essay prompts. All of these tests were given within 
the same year. The nonzero covariation among the essay residuals showed that the essays 
measured some common trait that was different from whatever traits the essays and 
TSWE shared. One positive aspect to this study was the independence of the three essay 
items in that they were obtained from three different writing occasions. In the Ackerman 
and Smith (1988) study reported earlier in this report, evidence of unique format 
contributions also were found. 

Bennett et al (1990) used confirmatory factor analysis to examine the infrastructure of 
the AP computer science examination. Each constructed-response item was treated as a 
separate variable; groups of ten or more multiple-choice items were formed, with each 
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group representing a separate variable. A one factor covariance structure model was used 
to analyze the data. The results indicated that both item formats measured the same 
characteristics, suggesting that the addition of the constructed-response items did not 
provide any additional information about the trait under investigation. In a later study by 
Bennett et al (1991), however, which used a structure hypothesizing separate format 
factors, the investigators found that the disattenuated correlation coefficients were 
significantly (though not substantially) different from unity. 

Even though added information for mixed item formats has been demonstrated 
occasionally (Ackerman & Smith, 1988; Werts, et al, 1980), it has not been possible to 
specify the different characteristics or skills measured by different item formats. Further, 
the evidence is generally shaky because of methodological weaknesses in the research. For 
example, some constructed-response items were generated through simple 
transformations of existing multiple-choice items rather than through any careful analysis 
of the types of tasks that might best he measured in the constructed response format. Also, 
the small number of constructed response items used in this work plays havoc with the 
interpretation of conventional statistical analyses (Mazzeo, Yamamote, & Kulick, 1993). 
Finally, even when evidence indicates that different item formats appear to assess 
psychometrically distinct traits (e.g., in the Ackerman and Smith study), without 
knowing precisely what the item types actually assess we cannot say whether they truly 
measure different traits or simply measure the same underlying traits differently. This 
concern points to the importance of having a clear and welbarticulated model of the 
construct in question as a guide to item development and psychometric analysis. 

Issues and problems with factor analytic studies 


Factor analysis has been used in several of the studies that have attempted to differentiate 
and uncover exactly what is being assessed by different formats. While robust with respect 
to many assumptions and issues, factor analysis is not without problems. First, results and 
interpretation of the results may depend on the model selected for the analysis, as well as 
the number of factors included in the model. Traub (1993) found that when a one-factor 
model was used to fit the correlation coefficient matrices, the results showed that the two- 
item formats actually measure the same characteristic. When the same data set was 
analyzed using a two-factor model — one factor for each format — the results indicated that 
the coefficients for the two factors were significantly, albeit only slightly, less than one 
(Traub, in Bennett and Ward, 1993). 

Second, the evidence is equivocal, sometimes even within the same study. Using a 
hierarchical factor model that included a general factor for all items and two orthogonal 
factors for the constructed-response items, Thissen (1994) found that constructed- 
response items produced a statistically significant factor, orthogonal to the general factor, 
suggesting that they measure something uniquely different from the multiple-choice 
items, which loaded almost exclusively on the general factor. The constructed-response 
items also loaded on the general factor; more importantly, their loadings on the general 
factor were usually larger than on the constructed response factors. This pattern of results 


Improving the Information Value of Performance Items 


7 


suggests the constructed-response items are measuring the “same” content as the 
multiple-choice items but are measuring something additional, which may be format. 
These results might be construed as evidence that what is being assessed by either item 
format is probably being assessed poorly; furthermore, we do not know for certain what 
constitutes the “thing” that is being measured differently be each item format. 

Also problematic is the consistent difference in the difficulty level of multiple-choice and 
constructed-response items. Thus, what appears to he construct unequivalence may be 
nothing more than difficulty loadings. These concerns point to the importance of basing 
psychometric analyses on welbarticulated theories of the constructs being measured. 
Otherwise, when the factor analysis suggests separate factors, it will provide little or no 
guidance about what those factors represent. 

Third, study design can cloud results and interpretation when using factor analysis. The 
most common design flaw in these factor analytic studies is the small number of 
constructed-response items uses for analysis. While small numbers of items do not 
inherently lead to unreliability, the capacity to find additional factors when they really 
exist is compromised by small item samples. The question is whether we are giving the 
constructed response format a chance to demonstrate uniqueness if and when it really 
exists. Also the way the constructed-response question is scored may have an impact on 
results; others things being equal, dichotomous holistic scores are likely to provide less 
information than a set of dimensional scores for the same item. 

Goldstein (1994) points out a fourth problem with traditional exploratory factor analytic 
techniques; this issue has to do with the “reference population.” When more than one 
subpopulation is under study, it is possible for the model to “fit” well in one subgroup but 
not in the other subgroup(s), e.g., ethnic or gender difference studies. The ideal solution 
is to conduct separate analyses for each subpopulation using confirmatory factor analysis; 
however, subgroup analyses can lead to a host of reliability problems if there are small 
samples in each subgroup. This is especially problematic in analyses of race and ethnicity; 
usually the sample size for Caucasian and African American students is adequate, but 
analyses of other ethnic groups are risky at best. 

Refocusing our energies 

Some scholars of assessment (for example, Cronbach, 1988; Nickerson, 1990; Snow, 
1990; 1993) have encouraged us to set aside our psychometric and methodological 
approaches to studying validity issues and turn our attention instead to the constructs 
underlying the assessment enterprise and to a “psychology of test design” (Snow, 1993). 
Snow, in particular, has encouraged us to expand our “typical” way of thinking about 
construct validity so that we can escape the boundaries of our current paradigms. 

The goal of research stemming from this perspective is to uncover the psychological 
underpinnings of a particular domain and then to relate them to the psychological 
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behaviors involved in answering certain items. Snow also encourages us to expand our 
thinking and research beyond the obvious cognitive demands of assessments: 

Although cognitive analysis of the contrast between constructed-response 
and multiple-choice test formats is essential, so is analysis of the conative 
(i.e., motivational' volitional) and affective aspects of performance that 
connect to the contrast (p. 46). 

The key is to recognize not only that different item formats may elicit different cognitive 
characteristics (structures, to borrow from information processing) or processes which are 
necessary to respond successfully to the different item formats, but that the interpretation 
of scores also depends on whether different item formats elicit unique motivation, effort, 
or consequential behaviors or dispositions. For example, consider the following questions: 

• Do students exert the same amount of mental effort when they answer a mul- 
tiple-choice item as when they answer a constructed-response item? 

• In comparison with the panic that arises when students encounter a construct- 
ed response item for which they have absolutely no idea of even where to be- 
gin, are students more likely to take a risk and guess at the answer to a 
multiple-choice item in the belief that they have some non-zero chance of 
succeeding? 

• Does confidence in selecting (or even guessing at) an answer differ markedly 
from confidence in one’s ability to express oneself in writing? 

Snow (1993) goes on to propose an approach for investigating whether affective and 
conative characteristics are at play, along with cognitive processes. Building on the work 
of Cronbach (1988), he suggests that we adopt a “rival hypothesis” approach to studying 
these issues. To quote Cronbach: 

The advice [to pursue rival hypotheses] is not merely to be on the lookout 
for cases your hypothesis does not fit. The advice is to find, either in the 
relevant community of concerned persons or in your own devilish 
imagination, an alternative explanation of the accumulated findings; then 
to devise a study where the alternatives lead to disparate predictions. 

(P- H) 

Then, of course, one would collect the data that would allow evaluation of the rivals. 
Snow suggests an additional protective guideline — that the investigator word the 
hypothesis in a way that contradicts his or her personal belief(s). For example, an 
advocate of constructed-response format would phrase the hypothesis to favor a 
multiple-choice format, and vice-versa. 


New Initiatives 

Even though there exists a small corpus of careful studies that allow us to examine the 
relationship between multiple-choice and constructed-response items, we still have a 
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great deal to learn. Much of the problem in interpreting the current set of studies is that 
the research has been more opportunistic than intentional. In the prototypic study, 
researchers take advantage of the fact that an existing test or battery happens to have 
included both constructed-response and multiple-choice formats. Much rarer are studies 
in which the researchers have set out to design, from scratch, studies that have, as their 
expressed purpose, the evaluation of both the underlying constructs and the validity of 
the test(s) designed to measure those constructs. Messick (1993, p.64) recognized this 
problem in commenting upon our tendency, whether we come from a psychometric or a 
psychological perspective, to work from what we have rather than from a concept of what 
we want to learn: “However, both perspectives tend to rely too heavily on the construct 
analysis of existing tests, whether by factor analysis or task analysis, rather than focusing 
on theories of the construct domain as a guide to designing construct relevant tests.” 

What is needed is a fresh examination of the relationships between multiple-choice and 
constructed-response items — an examination that begins with an explication of a theory 
of the domain being assessed, which is then transformed into a theory of achievement 
within the domain. Such a theory would ultimately have to extend beyond the domain 
into more generic measurement issues (e.g., item format, test length, assessment context) 
and motivational issues (e.g., stakes, examinee preparation and anxiety) in order to test 
predictions about the nature and strength of relationships among components and/or 
between each component and some external criterion measure of the domain. This is, of 
course, exactly the sort of activity that psychometricians have had in mind for years in 
discussing the construct validation of tests (Messick, 1989; Cronbach, 1971). It is also, an 
activity that, for a variety of reasons, has escaped our attention or exceeded our capacity; 
we seldom approach that ideal. Nevertheless, short of a complete evaluation of the 
construct in question, there are a number of useful but less ambitious initiatives that 
would allow us to answer the question of value added for performance items with greater 
assurance than is currently possible. As first steps at working toward building this theory, 
we close this paper by sketching out, in broad terms, a set of studies that should provide us 
with needed information about the value added of performance items in mixed format 
assessments such as NAEP. 

The cognitive demands of multiple-choice and constructed-response items 


Even if it were to be demonstrated that comparable sets of multiple-choice and 
constructed-response items were psychometrically equivalent, it does not follow that they 
are psychologically equivalent, i.e., that they are variant indices of the same underlying 
construct; high degrees of shared variance could stem from any number of circumstances. 
We need studies in which we examine psychological equivalence directly. Even better 
would be studies in which we examine psychometric and psychological equivalence for a 
common set of items. The most productive tool for determining the psychological 
processes elicited by test items is the think-aloud procedure, which has been used with 
NAEP items in some previous work (Yepes-Bayara, 1996; Campbell, 1996). Participants 
could be asked to share their step-by-step thinking as they attempt to select or construct 
responses to test items. Although think-aloud methodology appears promising for this 
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sort of initiative, it is by no means the only index of cognitive functioning that we should 
consider. When tasks involve text reading and response, both eye-movement 
methodology and computer controlled text search (look-back) methodology could tell us a 
great deal about the influence of item format on the role of text in selecting/ 
constructing responses. 


The value of different types of items in educational decision-making 

The question of whether different formats provide “additional information” is as much a 
question of “consumer” (i.e., test user) perception as it is psychometric independence. We 
need to know whether the addition of information from performance assessment provides 
new information from the perspective of those who use the information for making decisions, 
either about individuals or groups (in the case of NAEP, only decisions about groups are 
relevant). Ultimately, the question of whether constructed-response assessments possess 
interpretive value is an empirical question, and deserves to be answered empirically by 
observing the uses to which test users put the information they receive. We propose a 
study in which subject matter specialists are provided with tailored reports of NAEP 
results and asked to draw conclusions about student performance in their subject area (or, 
in the case of state-by-state comparisons, about the relative distribution of achievement). 
Within such a context, systematic variation in the nature of the information provided 
could be introduced (only multiple-choice, only constructed-response, or both). We 
would examine the influence of different information packages on the nature of the 
conclusions that these specialist draw about aggregate performance and the suggestions 
they make about changes in curricular or instructional policy. 


Rubric research 

The NAEP rubrics for reading have been roundly criticized by two separate evaluation 
panels. They are viewed as too quantitative and only marginally related to the NAEP 
framework for reading (DeStefano, Pearson, & Afflerbach, 1997). High dividends might 
result from a modest investment in creating new rubrics that are driven by the framework 
and then comparing the quality of information, both psychometrically and pragmatically, 
received when items are scored by these rubrics in contrast to the conventional rubrics. In 
another vein, we might examine the conceptual genesis of rubrics, paralleling 
Fredericksen’s (1984) questions about whether transforming multiple-choice items into 
performance items is the same as transforming performance items into multiple-choice. 
Suppose the rubrics for a set of constructed-response items are based upon the same 
conception of underlying dimensions (the psychological construct) as were used to guide 
the development of a comparable set of multiple-choice items. Such a practice might, in 
fact, be reasonable if our goal is to examine trait equivalence across item formats; 
however, this practice can also constrain our thinking about the range of possible traits 
that might be assessed with the constructed-response format and teased out by an 
appropriate rubric. In other words, in achieving control for conceptual equivalence, we 
might be losing our capacity to uncover a larger set of possible dimensions of the 
construct that can only be tapped by the constructed-response format. This issue could be 
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addressed in a study in which competing rubrics were developed and used to score a 
common set of constructed-response items. The first rubric would be developed using a 
framework that had been used to generate multiple-choice items and then extended to 
constructed-response items. The second rubric would result from a fresh perspective: 
subject matter experts would he asked to generate a framework and related rubrics for an 
assessment system consisting only of constructed-response items. The question of interest 
is whether the two rubrics would yield equivalent scores and or trait information. 


Truth in advertising 

Many educators, and ordinary citizens find it odd that students are unaware of the criteria 
by which their extended constructed responses will be evaluated. It would be useful to 
know, even in a drop-out-of-the-sky assessment such as NAEP, whether students perform 
better when they know what is expected of them. We would easily embed an 
experimental form into NAEP; in such a form, a student version of the NAEP rubric 
would he attached to every extended constructed-response item, perhaps with an initial 
introduction to the importance of reviewing these rubrics before answering the questions. 


Double scoring 

Since reading (as a medium of information delivery) and writing (as a mode of response) 
are often employed in assessments of other subject matters (especially social studies and 
science, and, to a lesser extent, mathematics), the possibility arises that we might achieve 
greater cost-effectiveness by selecting some constructed-response items for double scoring. 
As suggested earlier, there is precedent for this practice in the work of the state of 
Maryland and the New Standards project, and there has been interest in this possibility 
within NCES (White, 1997). One possibility is to take advantage of already existing 
items which seem to lend themselves to double scoring. The cost advantages here are 
clear; in fact, the documents required to conduct the study (student test forms) may 
already exist within NAEP archives. A second possibility starts with the assumption that 
double-scoring is likely to be more effective if the items are initially constructed with that 
purpose in mind. In other words, if item writers knew that an item would be scored for 
both history and reading, both mathematics and writing, or both reading and writing, 
they might construct it differently from the way they would for single-subject scoring. A 
remote, but intriguing possibility would be to compare the efficacy of double-scoring 
using items that have been serendipitously generated versus those generated 
intentionally. In either situation, it would be important to examine carefully the 
cost-effectiveness of double-scoring. If, as some other studies suggest, scoring costs dwarf 
item-development and administration costs, then double-scoring may offer little or no 
overall cost savings. 


Dimensional/opportunistic scoring 

There is a core body of research, mainly in writing assessment, which examines the 
efficacy of scoring items or papers separately for two or more dimensions within a domain. 
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That research needs to be extended into other subject matter domains to determine 
whether constructed-responses are capable of yielding separate, perhaps even 
independent, estimates of performance on different aspects of a subject matter. In 
mathematics, for example, it is possible that a given response could be scored once for 
computational accuracy, a second time for evidence of problem-solving, and a third time 
for communication prowess. In reading, could a single response yield independent 
evidence of more than one of the basic stances in the NAEP framework: initial 
understanding, developing interpretation, personal response, or critical stance? 
Heretofore, we have tended to classify items independent of the responses they yield, or, 
in the case of reading, we have classified each level of the rubric for a given item into a 
given stance category (a level 1 response is initial understanding while a level 3 or 4 is 
critical stance). We could develop a more opportunistic scoring system for 
constructed-response items, one that was capable of taking advantage of and giving credit 
for evidence of performance on a given dimension wherever and whenever it emerged, 
even if it emerged in situations where the item developers were least expecting it. 


The biasing effect of surface features of language 

Constructed response items put a premium on the expressive, especially the writing, 
abilities of students. In a writing assessment, scoring procedures that reward clear 
expression and that privilege dominant (i.e., standardized or conventional) forms of 
English is both understandable and even desirable. But what role, if any, should clarity 
and conventionality of expression play when writing is used as the vehicle, as the 
response format, to get at other knowledge and skill outcomes? More importantly, do 
constructed-response items have a depressing effect on the scores of students whose 
primary language is not standard English? For example, given responses that are identical 
in content but which differ systematically in their adherence to standard English, will the 
responses receive the same scores? If bias exists in unspecified scoring procedures, can it 
be reduced or eliminated through the use of clear scoring guidelines (remember, it is the 
ideas, not the quality or clarity of the expression, that count) and benchmark papers 
(high scoring exemplars which do not adhere to standard English). On the face of it, such 
a study may not seem to fit under the “quality of information” rubric, yet closer analysis 
suggests that quality of information about particular populations is exactly what is at stake 
here. We could easily embed such a study in a regular NAEP scoring conference. We 
could devise experimental responses in which we systematically altered adherence to 
features of standard English while holding content constant, and we could then 
determine how the competing versions fared during a normal NAEP scoring session. In 
the event that NAEP contractors balked at such an experiment (most likely on the 
ethical principle that it is deceptive vis-a-vis scorers), the study would have to be 
contracted out. In either case, it seems critical to know whether this sort of bias is present 
within our scoring systems. This question seems relevant to constructed-response items in 
all areas save writing. 
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The role of passage difficulty in reading assessment 

The issue of passage difficulty in reading, particularly its potentially depressing effect on 
performance of students at the lower end of the performance continuum, has been 
emphasized in a number of recent reports (e.g., Forgione, 1996; DeStefano et al, 1997; 
NAE, 1997), and concerned scholars and policy makers have called for the production of 
easier blocks of NAEP reading items so that low-achieving students can at least “make it 
onto the scale,” or in the language of information value, so that we possess more 
information about the performance of low-achieving students. If these more “accessible” 
blocks are created, and if we are thoughtful about how we design and generate items 
across blocks, we have an opportunity to determine whether response format 
(multiple-choice versus constructed-response) or passage difficulty (or some unique 
combination of the two) is responsible for the current low information yields of 
constructed-response items. It might be, for example, that students have a lot more to say 
when it is relatively easy for them to read, digest, think about, and even critique the texts 
they encounter. It might also turn out that difficulty interacts with achievement level in 
such a way that easy passages provide opportunities for low-achieving students to shine 
whereas hard passages provide just the challenge that high-achieving students need to get 
involved in the assessment. 


The prospect of differential outcomes for different populations 

We have not emphasized the importance of imposing an individual differences filter on 
any or all of the studies outlined thus far. We are aware of the importance of determining 
how each of these questions would be answered for different groups of students; however, 
we are also mindful of cost factors associated with increases in sample size and the 
reliability threats that arise when ample subsamples are not used. However, we are open 
to the possibility that one or another of these initiatives cries out for partitioning samples 
on some well-reasoned basis. 


Prior experience 

As with different populations, prior experience could serve as a filter or an independent 
variable in several of the studies outlined earlier. Prior experience has two possible 
realizations, one at the classroom/school level and one at the individual level. At the 
classroom/school level, it is instantiated as instructional experience (opportunity to 
learn). If we can he sure of the type of curricular emphases different students have 
experienced (e.g., emphasis on critical stance or response to literature in reading or 
problem-solving and communication in math), and if we can locate populations with 
different curricular histories, we can test constructed-response item performance under 
both optimal and suboptimal conditions: Do students who have learned what the items 
are designed to measure perform at high levels compared to students who have received 
other curricular emphases? It would be interesting, for example, to conduct the 
think-aloud study described earlier in sites which exhibit just such a curricular contrast. 
At the individual level, of course, prior experience is instantiated as prior knowledge, the 
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impact of which is welhdocumented in reading and writing assessment. As with the 
passage difficulty issue discussed earlier, it would be useful to know whether students 
provide more elaborate and more sophisticated responses to constructed-response 
prompts when they are quite knowledgeable about the topic at hand. To answer this 
question, we would have to obtain an independent measure of topical knowledge for the 
subject passage (in the case of reading), prompt (in the case of writing), or task (in the 
case of science or math). 


Transforming items across formats 

When evaluating the equivalence of constructed-response and multiple-choice items, 
researchers sometimes begin with one set of items, say multiple-choice, and rewrite them 
as constructed-response, or vice-versa. In this way, they attempt to control the content 
and focus of the items across formats. In other studies, there is no attempt to control for 
content and focus; instead researchers take advantage of the fact that an existing test 
happens to contain some multiple-choice and some constructed-response items. What we 
need are studies in which both multiple-choice and constructed-response items are 
developed in ways that allow each to “put their best foot forward.” To our knowledge, 
Fredericksen ( 1984) is one of the few researchers to consider the possibility that we may 
be introducing a source of bias when, for example, constructed-response items are 
generated by transforming an existing set of multiple-choice items. He also is one of the 
few researchers to develop multiple-choice items from an existing set of constructed- 
response items. This study would extend the logic of his dual source approach to item 
generation. This could be accomplished with a procedure along these lines: 

• Identify a domain of interest, such as reading comprehension, response to lit- 
erature, mathematical power, etc. 

• Identify one group of item writers with reputations for developing first-rate 
multiple-choice items; identify a second group with equally strong reputations 
for constructed-response items. 

• Set each group to work on developing a set of items for the domain of interest. 

• When each group is finished, ask them to exchange item groups and, as best 
they can, transform each multiple-choice item into a constructed-response 
item and vice-versa. 

• Create matched item sets, balanced for content and format. 

• Administer to students, and evaluate relationships between constructed-re- 
sponse and multiple-choice item subsets. 

While a study of this magnitude, both in terms of item development and requisite sample 
size, would be expensive if conducted as an independent, stand-alone initiative, it could 
easily be integrated into the item tryouts conducted as a matter of course in the NAEP 
item development process. Furthermore, it seems to lend itself to several content 
domains, and the question is timely, where timeliness is indexed by conflict, tension, and 
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interest within the field, in math, science, social studies, and reading. Moreover, with 
appropriate design manipulations, it may even be possible to address the tension that 
continues to haunt subject specialists: Why do we continue to use psychometric tools, 
such as IRT, which assume unidimensionality when we believe that the domain is 
inherently multidimensional? 


Concluding Remarks 

We believe that the question of what constructed-response items add to the 
interpretation of NAEP results, although important in any era, is critical at this point in 
NAEP’s history. The critics are lining up to recommend that constructed-response items 
be eliminated in the name of economy and precision. Until, and unless, it can be 
demonstrated that there are grounds, whether psychometric, conceptual, or pragmatic, for 
maintaining or expanding the emphasis on constructed-response items, their use will be 
questioned, and questionable. Consequently, we believe that NCES and the National 
Assessment Governing Board (NAGB), as the custodians of NAEP’s efficacy and 
credibility, should do everything possible to see that studies such as those proposed in this 
concept paper are carried out with all deliberate speed. If programs of research such as 
these are not undertaken, then the decisions about the relative emphases on different 
item formats will be based upon considered opinion or political expediencies. Our 
nation’s assessment centerpiece deserves a better hearing, one in which evidence is 
paramount. 
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Peter Stowe 
Steven Kaufman 
Peter Stowe 


Dawn Nelson 
Dawn Nelson 
Jeffrey Owings 

Jeffrey Owings 


Samuel Peng 
Dawn Nelson 
Dawn Nelson 
Marilyn Seastrom 


William J. Fowler, Jr. 
William J. Fowler, Jr. 
William J. Fowler, Jr. 
William J. Fowler, Jr. 
William J. Fowler, Jr. 



No. Title 

2002-04 Improving Consistency of Response Categories Across NCES Surveys 

National Assessment of Educational Progress (NAEP) 

95-12 Rural Education Data User’s Guide 

97-29 Can State Assessment Data be Used to Reduce State NAEP Sample Sizes? 

97-30 ACT's NAEP Redesign Project: Assessment Design is the Key to Useful and Stable 
Assessment Results 

97-31 NAEP Reconfigured: An Integrated Redesign of the National Assessment of Educational 
Progress 

97-32 Innovative Solutions to Intractable Large Scale Assessment (Problem 2: Background 
Questionnaires) 

97- 37 Optimal Rating Procedures and Methodology for NAEP Open-ended Items 

97^14 Development of a SASS 1993-94 School-Level Student Achievement Subfile: Using 

State Assessments and State NAEP, Feasibility Study 

98- 15 Development of a Prototype System for Accessing Linked NCES Data 

1999-05 Procedures Guide for Transcript Studies 

1999-06 1998 Revision of the Secondary School Taxonomy 

2001-07 A Comparison of the National Assessment of Educational Progress (NAEP), the Third 

International Mathematics and Science Study Repeat (TIMSS-R), and the Programme 
for International Student Assessment (PISA) 

2001-08 Assessing the Lexile Framework: Results of a Panel Meeting 
2001-1 1 Impact of Selected Background Variables on Students’ NAEP Math Performance 
2001-13 The Effects of Accommodations on the Assessment of LEP Students in NAEP 
2001-19 The Measurement of Home Background Indicators: Cognitive Laboratory Investigations 
of the Responses of Fourth and Eighth Graders to Questionnaire Items and Parental 
Assessment of the Invasiveness of These Items 


NCES contact 
Marilyn Seastrom 


Samuel Peng 
Steven Gorman 

Steven Gorman 

Steven Gorman 

Steven Gorman 

Steven Gorman 
Michael Ross 

Steven Kaufman 
Dawn Nelson 
Dawn Nelson 
Arnold Goldstein 


Sheida White 
Arnold Goldstein 
Arnold Goldstein 
Arnold Goldstein 



No. Title 

2002-04 Improving Consistency of Response Categories Across NCES Surveys 
2002-06 The Measurement of Instructional Background Indicators: Cognitive Laboratory 

Investigations of the Responses of Fourth and Eighth Grade Students and Teachers to 
Questionnaire Items 

2002- 07 Teacher Quality, School Context, and Student Race/Ethnicity: Findings from the Eighth 

Grade National Assessment of Educational Progress 2000 Mathematics Assessment 

2003- 06 NAEP Validity Studies: The Validity of Oral Accommodation in Testing 

2003-07 NAEP Validity Studies: An Agenda for NAEP Validity Research 

2003-08 NAEP Validity Studies: Improving the Information Value of Performance Items 
in Large Scale Assessments 

2003-09 NAEP Validity Studies: Optimizing State NAEP: Issues and Possible 
Improvements 

2003-10 A Content Comparison of the NAEP and PIRLS Fourth-Grade Reading Assessments 
2003-11 NAEP Validity Studies: Reporting the Results of the National Assessment of 
Educational Progress 

2003-12 NAEP Validity Studies: An Investigation of Why Students Do Not Respond to 
Questions 

2003-13 NAEP Validity Studies: A Study of Equating in NAEP 

2003-14 NAEP Validity Studies: Feasibility Studies of Two-Stage Testing in Large-Scale 
Educational Assessment: Implications for NAEP 
2003-15 NAEP Validity Studies: Computer Use and Its Relation to Academic 
Achievement in Mathematics, Reading, and Writing 
2003-16 NAEP Validity Studies: Implications of Electronic Technology for the NAEP 
Assessment 

2003-17 NAEP Validity Studies: The Effects of Finite Sampling on State Assessment 
Sample Requirements 

National Education Longitudinal Study of 1988 (NELS:88) 

95-04 National Education Longitudinal Study of 1988: Second Follow-up Questionnaire Content 
Areas and Research Issues 

95-05 National Education Longitudinal Study of 1988: Conducting Trend Analyses of NLS-72, 
HS&B. and NELS:88 Seniors 

95-06 National Education Longitudinal Study of 1988: Conducting Cross-Cohort Comparisons 
Using HS&B. NAEP, and NELS:88 Academic Transcript Data 
95-07 National Education Longitudinal Study of 1988: Conducting Trend Analyses HS&B and 
NELS:88 Sophomore Cohort Dropouts 
95-12 Rural Education Data User’s Guide 

95- 14 Empirical Evaluation of Social, Psychological, & Educational Construct Variables Used 

in NCES Surveys 

96- 03 National Education Longitudinal Study of 1988 (NELS:88) Research Framework and 

Issues 

98-06 National Education Longitudinal Study of 1988 (NELS:88) Base Year through Second 
Follow-Up: Final Methodology Report 

98-09 High School Curriculum Structure: Effects on Coursetaking and Achievement in 

Mathematics for High School Graduates — An Examination of Data from the National 
Education Longitudinal Study of 1988 

98-15 Development of a Prototype System for Accessing Linked NCES Data 
1999-05 Procedures Guide for Transcript Studies 
1999-06 1998 Revision of the Secondary School Taxonomy 

1999-15 Projected Postsecondary Outcomes of 1992 High School Graduates 

2001- 16 Imputation of Test Scores in the National Education Longitudinal Study of 1988 

2002- 04 Improving Consistency of Response Categories Across NCES Surveys 

2003- 01 Mathematics, Foreign Language, and Science Coursetaking and the NELS:88 Transcript 

Data 

2003-02 English Coursetaking and the NELS:88 Transcript Data 

National Household Education Survey (NHES) 

95- 12 Rural Education Data User’s Guide 

96- 13 Estimation of Response Bias in the NHES:95 Adult Education Survey 

96-14 The 1995 National Household Education Survey: Reinterview Results for the Adult 
Education Component 


NCES contact 
Marilyn Seastrom 
Arnold Goldstein 


Janis Brown 

Patricia Dabbs 
Patricia Dabbs 
Patricia Dabbs 

Patricia Dabbs 

Marilyn Binkley 
Patricia Dabbs 

Patricia Dabbs 

Patricia Dabbs 
Patricia Dabbs 

Patricia Dabbs 

Patricia Dabbs 

Patricia Dabbs 


Jeffrey Owings 

Jeffrey Owings 

Jeffrey Owings 

Jeffrey Owings 

Samuel Peng 
Samuel Peng 

Jeffrey Owings 

Ralph Lee 

Jeffrey Owings 

Steven Kaufman 
Dawn Nelson 
Dawn Nelson 
Aurora D’Amico 
Ralph Lee 
Marilyn Seastrom 
Jeffrey Owings 

Jeffrey Owings 


Samuel Peng 
Steven Kaufman 
Steven Kaufman 



No. Title 

96-20 1991 National Household Education Survey (NHES:91) Questionnaires: Screener. Early 

Childhood Education, and Adult Education 

96-21 1993 National Household Education Survey (NHES:93) Questionnaires: Screener, School 

Readiness, and School Safety and Discipline 

96-22 1995 National Household Education Survey (NHES:95) Questionnaires: Screener, Early 

Childhood Program Participation, and Adult Education 
96-29 Undercoverage Bias in Estimates of Characteristics of Adults and 0- to 2- Year-Olds in the 
1995 National Household Education Survey (NHES:95) 

96- 30 Comparison of Estimates from the 1995 National Household Education Survey 

(NHES:95) 

97- 02 Telephone Coverage Bias and Recorded Interviews in the 1993 National Household 

Education Survey (NHES:93) 

97-03 1991 and 1995 National Household Education Survey Questionnaires: NHES:91 Screener, 

NHES:91 Adult Education, NHES:95 Basic Screener, and NHES:95 Adult Education 

97-04 Design, Data Collection, Monitoring, Interview Administration Time, and Data Editing in 
the 1993 National Household Education Survey (NHES:93) 

97-05 Unit and Item Response, Weighting, and Imputation Procedures in the 1993 National 
Household Education Survey (NHES:93) 

97-06 Unit and Item Response, Weighting, and Imputation Procedures in the 1995 National 
Household Education Survey (NHES:95) 

97-08 Design, Data Collection, Interview Timing, and Data Editing in the 1995 National 
Household Education Survey 

97-19 National Household Education Survey of 1995: Adult Education Course Coding Manual 

97-20 National Household Education Survey of 1995: Adult Education Course Code Merge 

Files User’s Guide 

97-25 1996 National Household Education Survey (NHES:96) Questionnaires: 

Screener/Household and Library, Parent and Family Involvement in Education and 
Civic Involvement, Youth Civic Involvement, and Adult Civic Involvement 
97-28 Comparison of Estimates in the 1996 National Household Education Survey 
97-34 Comparison of Estimates from the 1993 National Household Education Survey 
97-35 Design, Data Collection, Interview Administration Time, and Data Editing in the 1996 
National Household Education Survey 

97-38 Reinterview Results for the Parent and Youth Components of the 1996 National 
Household Education Survey 

97- 39 Undercoverage Bias in Estimates of Characteristics of Households and Adults in the 1996 

National Household Education Survey 

97^10 Unit and Item Response Rates, Weighting, and Imputation Procedures in the 1996 
National Household Education Survey 

98- 03 Adult Education in the 1990s: A Report on the 1991 National Household Education 

Survey 

98-10 Adult Education Participation Decisions and Barriers: Review of Conceptual Frameworks 
and Empirical Studies 

2002-04 Improving Consistency of Response Categories Across NCES Surveys 

National Longitudinal Study of the High School Class of 1972 (NLS-72) 

95- 12 Rural Education Data User’s Guide 

2002-04 Improving Consistency of Response Categories Across NCES Surveys 

National Postsecondary Student Aid Study (NPSAS) 

96- 17 National Postsecondary Student Aid Study: 1996 Field Test Methodology Report 

2000-17 National Postsecondary Student Aid Study:2000 Field Test Methodology Report 
2002-03 National Postsecondary Student Aid Study, 1999-2000 (NPSAS:2000), CATI 

Nonresponse Bias Analysis Report. 

2002-04 Improving Consistency of Response Categories Across NCES Surveys 

National Study of Postsecondary Faculty (NSOPF) 

97- 26 Strategies for Improving Accuracy of Postsecondary Faculty Lists 

98- 15 Development of a Prototype System for Accessing Linked NCES Data 

2000-01 1999 National Study of Postsecondary Faculty (NSOPF:99) Field Test Report 


NCES contact 
Kathryn Chandler 

Kathryn Chandler 

Kathryn Chandler 

Kathryn Chandler 

Kathryn Chandler 

Kathryn Chandler 

Kathryn Chandler 

Kathryn Chandler 

Kathryn Chandler 

Kathryn Chandler 

Kathryn Chandler 

Peter Stowe 
Peter Stowe 

Kathryn Chandler 

Kathryn Chandler 
Kathryn Chandler 
Kathryn Chandler 

Kathryn Chandler 

Kathryn Chandler 

Kathryn Chandler 

Peter Stowe 

Peter Stowe 

Marilyn Seastrom 

Samuel Peng 
Marilyn Seastrom 

Andrew G. Malizio 
Andrew G. Malizio 
Andrew Malizio 

Marilyn Seastrom 


Linda Zimbler 
Steven Kaufman 
Linda Zimbler 



No. Title 

2002-04 Improving Consistency of Response Categories Across NCES Surveys 

2002- 08 A Profile of Part-time Faculty: Fall 1998 

Postsecondary Education Descriptive Analysis Reports (PEDAR) 

2000-1 1 Financial Aid Profile of Graduate Students in Science and Engineering 

Private School Universe Survey (PSS) 

95-16 Intersurvey Consistency in NCES Private School Surveys 

95- 17 Estimates of Expenditures for Private K-12 Schools 

96- 16 Strategies for Collecting Finance Data from Private Schools 

96-26 Improving the Coverage of Private Elementary-Secondary Schools 

96- 27 Intersurvey Consistency in NCES Private School Surveys for 1993-94 

97- 07 The Determinants of Per-Pupil Expenditures in Private Elementary and Secondary 

Schools: An Exploratory Analysis 

97- 22 Collection of Private School Finance Data: Development of a Questionnaire 

98- 15 Development of a Prototype System for Accessing Linked NCES Data 

2000-04 Selected Papers on Education Surveys: Papers Presented at the 1998 and 1999 ASA and 
1999 AAPOR Meetings 

2000-15 Feasibility Report: School-Level Finance Pretest, Private School Questionnaire 

Progress in International Reading Literacy Study (PIRLS) 

2003- 05 PIRLS-IEA Reading Literacy Framework: Comparative Analysis of the 1991 IEA 

Reading Study and the Progress in International Reading Literacy Study 
2003-10 A Content Comparison of the NAEP and PIRLS Fourth-Grade Reading Assessments 

Recent College Graduates (RCG) 

98-15 Development of a Prototype System for Accessing Linked NCES Data 
2002-04 Improving Consistency of Response Categories Across NCES Surveys 


Schools and 

94-01 

94-02 

94-03 

94-04 

94- 06 

95- 01 

95-02 

95-03 

95-08 

95-09 

95-10 

95-11 

95-12 

95-14 

95-15 

95-16 

95- 18 

96- 01 
96-02 
96-05 


Staffing Survey (SASS) 

Schools and Staffing Survey (SASS) Papers Presented at Meetings of the American 
Statistical Association 

Generalized Variance Estimate for Schools and Staffing Survey (SASS) 

1991 Schools and Staffing Survey (SASS) Reinterview Response Variance Report 
The Accuracy of Teachers’ Self-reports on their Postsecondary Education: Teacher 
Transcript Study, Schools and Staffing Survey 
Six Papers on Teachers from the 1990-91 Schools and Staffing Survey and Other Related 
Surveys 

Schools and Staffing Survey: 1994 Papers Presented at the 1994 Meeting of the American 
Statistical Association 

QED Estimates of the 1990-91 Schools and Staffing Survey: Deriving and Comparing 
QED School Estimates with CCD Estimates 
Schools and Staffing Survey: 1990-91 SASS Cross-Questionnaire Analysis 
CCD Adjustment to the 1990-91 SASS: A Comparison of Estimates 
The Results of the 1993 Teacher List Validation Study (TLVS) 

The Results of the 1991-92 Teacher Follow-up Survey (TFS) Reinterview and Extensive 
Reconciliation 

Measuring Instruction, Curriculum Content, and Instructional Resources: The Status of 
Recent Work 

Rural Education Data User’s Guide 

Empirical Evaluation of Social, Psychological, & Educational Construct Variables Used 
in NCES Surveys 

Classroom Instructional Processes: A Review of Existing Measurement Approaches and 
Their Applicability for the Teacher Follow-up Survey 
Intersurvey Consistency in NCES Private School Surveys 

An Agenda for Research on Teachers and Schools: Revisiting NCES’ Schools and 
Staffing Survey 

Methodological Issues in the Study of Teachers’ Careers: Critical Features of a Truly 
Longitudinal Study 

Schools and Staffing Survey (SASS): 1995 Selected papers presented at the 1995 Meeting 
of the American Statistical Association 

Cognitive Research on the Teacher Listing Form for the Schools and Staffing Survey 


NCES contact 
Marilyn Seastrom 
Linda Zimbler 


Aurora D’Amico 


Steven Kaufman 
Stephen Broughman 
Stephen Broughman 
Steven Kaufman 
Steven Kaufman 
Stephen Broughman 

Stephen Broughman 
Steven Kaufman 
Dan Kasprzyk 

Stephen Broughman 


Laurence Ogle 
Marilyn Binkley 


Steven Kaufman 
Marilyn Seastrom 


Dan Kasprzyk 

Dan Kasprzyk 
Dan Kasprzyk 
Dan Kasprzyk 

Dan Kasprzyk 

Dan Kasprzyk 

Dan Kasprzyk 

Dan Kasprzyk 
Dan Kasprzyk 
Dan Kasprzyk 
Dan Kasprzyk 

Sharon Bobbitt & 
John Ralph 
Samuel Peng 
Samuel Peng 

Sharon Bobbitt 

Steven Kaufman 
Dan Kasprzyk 

Dan Kasprzyk 

Dan Kasprzyk 

Dan Kasprzyk 



No. 

96-06 

96-07 

96-09 

96-10 

96-11 

96-12 

96-15 

96-23 

96-24 

96-25 

96- 28 

97- 01 

97-07 

97-09 

97-10 

97-11 

97-12 

97-14 

97-18 

97-22 

97-23 

97-41 

97^12 

97- 44 

98- 01 
98-02 
98-04 
98-05 

98-08 

98-12 

98-13 

98-14 

98-15 

98-16 

1999-02 

1999-04 

1999-07 

1999-08 

1999-10 

1999-12 

1999-13 

1999-14 

1999- 17 

2000- 04 


Title 

The Schools and Staffing Survey (SASS) for 1998-99: Design Recommendations to 
Inform Broad Education Policy 

Should SASS Measure Instructional Processes and Teacher Effectiveness? 

Making Data Relevant for Policy Discussions: Redesigning the School Administrator 
Questionnaire for the 1998-99 SASS 
1998-99 Schools and Staffing Survey: Issues Related to Survey Depth 
Towards an Organizational Database on America’s Schools: A Proposal for the Future of 
SASS, with comments on School Reform, Governance, and Finance 
Predictors of Retention, Transfer, and Attrition of Special and General Education 
Teachers: Data from the 1989 Teacher Followup Survey 
Nested Structures: District-Level Data in the Schools and Staffing Survey 
Linking Student Data to SASS: Why, When, How 
National Assessments of Teacher Quality 

Measures of Inservice Professional Development: Suggested Items for the 1998-1999 
Schools and Staffing Survey 

Student Learning, Teaching Quality, and Professional Development: Theoretical 

Linkages, Current Measurement, and Recommendations for Future Data Collection 
Selected Papers on Education Surveys: Papers Presented at the 1996 Meeting of the 
American Statistical Association 

The Determinants of Per-Pupil Expenditures in Private Elementary and Secondary 
Schools: An Exploratory Analysis 
Status of Data on Crime and Violence in Schools: Final Report 

Report of Cognitive Research on the Public and Private School Teacher Questionnaires 
for the Schools and Staffing Survey 1993-94 School Year 
International Comparisons of Inservice Professional Development 
Measuring School Reform: Recommendations for Future SASS Data Collection 
Optimal Choice of Periodicities for the Schools and Staffing Survey: Modeling and 
Analysis 

Improving the Mail Return Rates of SASS Surveys: A Review of the Literature 
Collection of Private School Finance Data: Development of a Questionnaire 
Further Cognitive Research on the Schools and Staffing Survey (SASS) Teacher Listing 
Form 

Selected Papers on the Schools and Staffing Survey: Papers Presented at the 1997 Meeting 
of the American Statistical Association 

Improving the Measurement of Staffing Resources at the School Level: The Development 
of Recommendations for NCES for the Schools and Staffing Survey (SASS) 
Development of a SASS 1993-94 School-Level Student Achievement Subfile: Using 
State Assessments and State NAEP, Feasibility Study 
Collection of Public School Expenditure Data: Development of a Questionnaire 
Response Variance in the 1993-94 Schools and Staffing Survey: A Reinterview Report 
Geographic Variations in Public Schools’ Costs 

SASS Documentation: 1993-94 SASS Student Sampling Problems; Solutions for 

Determining the Numerators for the SASS Private School (3B) Second-Stage Factors 
The Redesign of the Schools and Staffing Survey for 1999-2000: A Position Paper 
A Bootstrap Variance Estimator for Systematic PPS Sampling 
Response Variance in the 1994-95 Teacher Follow-up Survey 
Variance Estimation of Imputed Survey Data 

Development of a Prototype System for Accessing Linked NCES Data 
A Feasibility Study of Longitudinal Design for Schools and Staffing Survey 
Tracking Secondary Use of the Schools and Staffing Survey Data: Preliminary Results 
Measuring Teacher Qualifications 

Collection of Resource and Expenditure Data on the Schools and Staffing Survey 
Measuring Classroom Instructional Processes: Using Survey and Case Study Fieldtest 
Results to Improve Item Construction 
What Users Say About Schools and Staffing Survey Publications 
1993-94 Schools and Staffing Survey: Data File User’s Manual, Volume III: Public-Use 
Codebook 

1993- 94 Schools and Staffing Survey: Data File User's Manual, Volume IV: Bureau of 
Indian Affairs (BIA) Restricted-Use Codebook 

1994- 95 Teacher Followup Survey: Data File User’s Manual, Restricted-Use Codebook 
Secondary Use of the Schools and Staffing Survey Data 

Selected Papers on Education Surveys: Papers Presented at the 1998 and 1999 ASA and 
1999 AAPOR Meetings 


NCES contact 
Dan Kasprzyk 

Dan Kasprzyk 
Dan Kasprzyk 

Dan Kasprzyk 
Dan Kasprzyk 

Dan Kasprzyk 

Dan Kasprzyk 
Dan Kasprzyk 
Dan Kasprzyk 
Dan Kasprzyk 

Mary Rollefson 

Dan Kasprzyk 

Stephen Broughman 

Lee Hoffman 
Dan Kasprzyk 

Dan Kasprzyk 
Mary Rollefson 
Steven Kaufman 

Steven Kaufman 
Stephen Broughman 
Dan Kasprzyk 

Steve Kaufman 

Mary Rollefson 

Michael Ross 

Stephen Broughman 
Steven Kaufman 
William J. Fowler, Jr. 
Steven Kaufman 

Dan Kasprzyk 
Steven Kaufman 
Steven Kaufman 
Steven Kaufman 
Steven Kaufman 
Stephen Broughman 
Dan Kasprzyk 
Dan Kasprzyk 
Stephen Broughman 
Dan Kasprzyk 

Dan Kasprzyk 
Kerry Gruber 

Kerry Gruber 

Kerry Gruber 
Susan Wiley 
Dan Kasprzyk 



No. Title 

2000-10 A Research Agenda for the 1999-2000 Schools and Staffing Survey 
2000-13 Non-professional Staff in the Schools and Staffing Survey (SASS) and Common Core of 
Data (CCD) 

2000- 18 Feasibility Report: School-Level Finance Pretest, Public School District Questionnaire 

2002-04 Improving Consistency of Response Categories Across NCES Surveys 

Third International Mathematics and Science Study (TIMSS) 

2001- 01 Cross-National Variation in Educational Preparation for Adulthood: From Early 

Adolescence to Young Adulthood 

2001-05 Using TIMSS to Analyze Correlates of Performance Variation in Mathematics 

2001- 07 A Comparison of the National Assessment of Educational Progress (NAEP), the Third 

International Mathematics and Science Study Repeat (TIMSS-R), and the Programme 
for International Student Assessment (PISA) 

2002- 01 Legal and Ethical Issues in the Use of Video in Education Research 


NCES contact 
Dan Kasprzyk 
Kerry Gruber 

Stephen Broughman 
Marilyn Seastrom 


Elvira Hausken 

Patrick Gonzales 
Arnold Goldstein 


Patrick Gonzales 



Listing of NCES Working Papers by Subject 


No. Title NCES contact 

Achievement (student) - mathematics 

2001-05 Using TIMSS to Analyze Correlates of Performance Variation in Mathematics Patrick Gonzales 

Adult education 

96-14 The 1995 National Household Education Survey: Reinterview Results for the Adult 
Education Component 

96-20 1991 National Household Education Survey (NHES:91) Questionnaires: Screener. Early 

Childhood Education, and Adult Education 

96- 22 1995 National Household Education Survey (NHES:95) Questionnaires: Screener. Early 

Childhood Program Participation, and Adult Education 
98-03 Adult Education in the 1990s: A Report on the 1991 National Household Education 
Survey 

98-10 Adult Education Participation Decisions and Barriers: Review of Conceptual Frameworks 
and Empirical Studies 

1999- 1 1 Data Sources on Lifelong Learning Available from the National Center for Education 

Statistics 

2000- 16a Lifelong Learning NCES Task Force: Final Report Volume I 

2000- 16b Lifelong Learning NCES Task Force: Final Report Volume II 

Adult literacy — see Literacy of adults 

American Indian - education 

1999-13 1993-94 Schools and Staffing Survey: Data File User’s Manual. Volume IV: Bureau of Kerry Gruber 

Indian Affairs (BIA) Restricted-Use Codebook 

Assessment/achievement 

95-12 Rural Education Data User’s Guide Samuel Peng 

95-13 Assessing Students with Disabilities and Limited English Proficiency James Houser 

97- 29 Can State Assessment Data be Used to Reduce State NAEP Sample Sizes? Larry Ogle 

97-30 ACT's NAEP Redesign Project: Assessment Design is the Key to Useful and Stable Larry Ogle 

Assessment Results 

97-31 NAEP Reconfigured: An Integrated Redesign of the National Assessment of Educational Larry Ogle 

Progress 

97-32 Innovative Solutions to Intractable Large Scale Assessment (Problem 2: Background Larry Ogle 

Questions) 

97- 37 Optimal Rating Procedures and Methodology for NAEP Open-ended Items Larry Ogle 

97^14 Development of a SASS 1993-94 School-Level Student Achievement Subfile: Using Michael Ross 

State Assessments and State NAEP, Feasibility Study 

98- 09 High School Curriculum Structure: Effects on Coursetaking and Achievement in Jeffrey Owings 

Mathematics for High School Graduates — An Examination of Data from the National 
Education Longitudinal Study of 1988 

2001- 07 A Comparison of the National Assessment of Educational Progress (NAEP), the Third Arnold Goldstein 

International Mathematics and Science Study Repeat (TIMSS-R), and the Programme 
for International Student Assessment (PISA) 

2001-1 1 Impact of Selected Background Variables on Students’ NAEP Math Performance Arnold Goldstein 

2001-13 The Effects of Accommodations on the Assessment of LEP Students in NAEP Arnold Goldstein 

2001- 19 The Measurement of Home Background Indicators: Cognitive Laboratory Investigations Arnold Goldstein 

of the Responses of Fourth and Eighth Graders to Questionnaire Items and Parental 
Assessment of the Invasiveness of These Items 

2002- 05 Early Childhood Longitudinal Study- Kindergarten Class of 1998-99 (ECLS-K), 

Psychometric Report for Kindergarten Through First Grade 


Steven Kaufman 

Kathryn Chandler 

Kathryn Chandler 

Peter Stowe 

Peter Stowe 

Lisa Hudson 

Lisa Hudson 
Lisa Hudson 


Elvira Hausken 



No. Title 

2002-06 The Measurement of Instructional Background Indicators: Cognitive Laboratory 

Investigations of the Responses of Fourth and Eighth Grade Students and Teachers to 
Questionnaire Items 

2002-07 Teacher Quality, School Context, and Student Race/Ethnicity: Findings from the Eighth 
Grade National Assessment of Educational Progress 2000 Mathematics Assessment 

Beginning students in postsecondary education 

98-1 1 Beginning Postsecondary Students Longitudinal Study First Follow-up (BPS:96-98) Field 
Test Report 

2001-04 Beginning Postsecondary Students Longitudinal Study: 1996-2001 (BPS: 1996/2001) 

Field Test Methodology Report 

Civic participation 

97-25 1996 National Household Education Survey (NHES:96) Questionnaires: 

Screener/Household and Library, Parent and Family Involvement in Education and 
Civic Involvement, Youth Civic Involvement, and Adult Civic Involvement 

Climate of schools 

95-14 Empirical Evaluation of Social, Psychological, & Educational Construct Variables Used 
in NCES Surveys 

Cost of education indices 

94-05 Cost-of-Education Differentials Across the States 


Course-taking 


95- 

98- 


-12 

-09 


1999-05 

1999-06 

2003-01 

2003-02 

Crime 

97-09 


Rural Education Data User’s Guide 

High School Curriculum Structure: Effects on Coursetaking and Achievement in 

Mathematics for High School Graduates — An Examination of Data from the National 
Education Longitudinal Study of 1988 
Procedures Guide for Transcript Studies 
1998 Revision of the Secondary School Taxonomy 

Mathematics, Foreign Language, and Science Coursetaking and the NELS:88 Transcript 
Data 

English Coursetaking and the NELS:88 Transcript Data 


Status of Data on Crime and Violence in Schools: Final Report 


Curriculum 

95-1 1 Measuring Instruction, Curriculum Content, and Instructional Resources: The Status of 
Recent Work 

98-09 High School Curriculum Structure: Effects on Coursetaking and Achievement in 

Mathematics for High School Graduates — An Examination of Data from the National 
Education Longitudinal Study of 1988 


Customer service 

1999- 10 What Users Say About Schools and Staffing Survey Publications 

2000- 02 Coordinating NCES Surveys: Options, Issues, Challenges, and Next Steps 

2000-04 Selected Papers on Education Surveys: Papers Presented at the 1998 and 1999 ASA and 

1999 AAPOR Meetings 


Data quality 


97-13 

2001-11 

2001-13 

2001-19 


2002-06 


Improving Data Quality in NCES: Database-to-Report Process 
Impact of Selected Background Variables on Students’ NAEP Math Performance 
The Effects of Accommodations on the Assessment of LEP Students in NAEP 
The Measurement of Home Background Indicators: Cognitive Laboratory Investigations 
of the Responses of Fourth and Eighth Graders to Questionnaire Items and Parental 
Assessment of the Invasiveness of These Items 
The Measurement of Instructional Background Indicators: Cognitive Laboratory 

Investigations of the Responses of Fourth and Eighth Grade Students and Teachers to 
Questionnaire Items 


NCES contact 
Arnold Goldstein 


Janis Brown 


Aurora D’Amico 
Paula Knepper 


Kathryn Chandler 


Samuel Peng 


William J. Fowler, Jr. 


Samuel Peng 
Jeffrey Owings 

Dawn Nelson 
Dawn Nelson 
Jeffrey Owings 

Jeffrey Owings 


Lee Hoffman 


Sharon Bobbitt & 
John Ralph 
Jeffrey Owings 


Dan Kasprzyk 
Valena Plisko 
Dan Kasprzyk 


Susan Ahmed 
Arnold Goldstein 
Arnold Goldstein 
Arnold Goldstein 


Arnold Goldstein 


Data warehouse 



No. Title NCES contact 

2000-04 Selected Papers on Education Surveys: Papers Presented at the 1998 and 1999 ASA and Dan Kasprzyk 

1999 AAPOR Meetings 

Design effects 

2000- 03 Strengths and Limitations of Using SUDAAN, Stata, and WesVarPC for Computing Ralph Lee 

Variances from NCES Data Sets 

Dropout rates, high school 

95- 07 National Education Longitudinal Study of 1988: Conducting Trend Analyses HS&B and Jeffrey Owings 

NELS:88 Sophomore Cohort Dropouts 

Early childhood education 

96- 20 1991 National Household Education Survey (NHES:91) Questionnaires: Screener. Early 

Childhood Education, and Adult Education 

96- 22 1995 National Household Education Survey (NHES:95) Questionnaires: Screener. Early 

Childhood Program Participation, and Adult Education 

97- 24 Formulating a Design for the ECLS: A Review of Longitudinal Studies 

97- 36 Measuring the Quality of Program Environments in Head Start and Other Early Childhood 

Programs: A Review and Recommendations for Future Research 

1999- 01 A Birth Cohort Study: Conceptual and Design Considerations and Rationale 

2001- 02 Measuring Father Involvement in Young Children's Lives: Recommendations for a 

Fatherhood Module for the ECLS-B 

2001-03 Measures of Socio-Emotional Development in Middle School 

2001- 06 Papers from the Early Childhood Longitudinal Studies Program: Presented at the 2001 

AERA and SRCD Meetings 

2002- 05 Early Childhood Longitudinal Study- Kindergarten Class of 1998-99 (ECLS-K), 

Psychometric Report for Kindergarten Through First Grade 

Educational attainment 

98- 1 1 Beginning Postsecondary Students Longitudinal Study First Follow-up (BPS:96-98) Field 

Test Report 

2001- 15 Baccalaureate and Beyond Longitudinal Study: 2000/01 Follow-Up Field Test 

Methodology Report 

Educational research 

2000- 02 Coordinating NCES Surveys: Options, Issues, Challenges, and Next Steps Valena Plisko 

2002- 01 Legal and Ethical Issues in the Use of Video in Education Research Patrick Gonzales 

Eighth-graders 

2001- 05 Using TIMSS to Analyze Correlates of Performance Variation in Mathematics 

2002- 07 Teacher Quality, School Context, and Student Race/Ethnicity: Findings from the Eighth 

Grade National Assessment of Educational Progress 2000 Mathematics Assessment 

Employment 

96-03 National Education Longitudinal Study of 1988 (NELS:88) Research Framework and 
Issues 

98-1 1 Beginning Postsecondary Students Longitudinal Study First Follow-up (BPS:96-98) Field 
Test Report 

2000-16a Lifelong Learning NCES Task Force: Final Report Volume I 

2000- 16b Lifelong Learning NCES Task Force: Final Report Volume II 

2001- 01 Cross-National Variation in Educational Preparation for Adulthood: From Early 

Adolescence to Young Adulthood 

Employment - after college 

2001-15 Baccalaureate and Beyond Longitudinal Study: 2000/01 Follow-Up Field Test Andrew G. Malizio 

Methodology Report 

Engineering 

2000-1 1 Financial Aid Profile of Graduate Students in Science and Engineering Aurora D’Amico 


Jeffrey Owings 

Aurora D’Amico 

Lisa Hudson 
Lisa Hudson 
Elvira Hausken 


Patrick Gonzales 
Janis Brown 


Aurora D'Amico 
Andrew G. Malizio 


Kathryn Chandler 

Kathryn Chandler 

Jerry West 
Jerry West 

Jerry West 
Jerry West 

Elvira Hausken 
Jerry West 

Elvira Hausken 


Enrollment - after college 



No. 

Title 

NCES contact 

2001-15 

Baccalaureate and Beyond Longitudinal Study: 2000/01 Follow-Up Field Test 
Methodology Report 

Andrew G. Malizio 

Faculty - 

higher education 


97-26 

Strategies for Improving Accuracy of Postsecondary Faculty Lists 

Linda Zimbler 

2000-01 

1999 National Study of Postsecondary Faculty (NSOPF:99 ) Field Test Report 

Linda Zimbler 

2002-08 

A Profile of Part-time Faculty: Fall 1998 

Linda Zimbler 

Fathers - 

role in education 


2001-02 

Measuring Father Involvement in Young Children's Lives: Recommendations for a 
Fatherhood Module for the ECLS-B 

Jerry West 

Finance - 

elementary and secondary schools 


94-05 

Cost-of-Education Differentials Across the States 

William J. Fowler, Jr. 

96-19 

Assessment and Analysis of School-Level Expenditures 

William J. Fowler, Jr. 

98-01 

Collection of Public School Expenditure Data: Development of a Questionnaire 

Stephen Broughman 

1999-07 

Collection of Resource and Expenditure Data on the Schools and Staffing Survey 

Stephen Broughman 

1999-16 

Measuring Resources in Education: From Accounting to the Resource Cost Model 
Approach 

William J. Fowler, Jr. 

2000-18 

Feasibility Report: School-Level Finance Pretest, Public School District Questionnaire 

Stephen Broughman 

Finance - 

postsecondary 


97-27 

Pilot Test of IPEDS Finance Survey 

Peter Stowe 

2000-14 

IPEDS Finance Data Comparisons Under the 1997 Financial Accounting Standards for 
Private, Not-for-Profit Institutes: A Concept Paper 

Peter Stowe 

Finance - 

private schools 


95-17 

Estimates of Expenditures for Private K-12 Schools 

Stephen Broughman 

96-16 

Strategies for Collecting Finance Data from Private Schools 

Stephen Broughman 

97-07 

The Determinants of Per-Pupil Expenditures in Private Elementary and Secondary 
Schools: An Exploratory Analysis 

Stephen Broughman 

97-22 

Collection of Private School Finance Data: Development of a Questionnaire 

Stephen Broughman 

1999-07 

Collection of Resource and Expenditure Data on the Schools and Staffing Survey 

Stephen Broughman 

2000-15 

Feasibility Report: School-Level Finance Pretest, Private School Questionnaire 

Stephen Broughman 

Geography 


98-04 

Geographic Variations in Public Schools’ Costs 

William J. Fowler, Jr. 

Graduate students 


2000-1 1 

Financial Aid Profile of Graduate Students in Science and Engineering 

Aurora D’Amico 

Graduates of postsecondary education 


2001-15 

Baccalaureate and Beyond Longitudinal Study: 2000/01 Follow-Up Field Test 
Methodology Report 

Andrew G. Malizio 

Imputation 


2000-04 

Selected Papers on Education Surveys: Papers Presented at the 1998 and 1999 ASA and 
1999 AAPOR Meeting 

Dan Kasprzyk 

2001-10 

Comparison of Proc Impute and Schafer's Multiple Imputation Software 

Sam Peng 

2001-16 

Imputation of Test Scores in the National Education Longitudinal Study of 1988 

Ralph Lee 

2001-17 

A Study of Imputation Algorithms 

Ralph Lee 

2001-18 

A Study of Variance Estimation Methods 

Ralph Lee 

Inflation 

97-43 

Measuring Inflation in Public School Costs 

William J. Fowler, Jr. 


Institution data 

2000-01 1999 National Study of Postsecondary Faculty (NSOPF:99) Field Test Report Linda Zimbler 


Instructional resources and practices 

95-1 1 Measuring Instruction, Curriculum Content, and Instructional Resources: The Status of 
Recent Work 


Sharon Bobbitt & 
John Ralph 



NCES contact 
Dan Kasprzyk 


No. Title 

1999-08 Measuring Classroom Instructional Processes: Using Survey and Case Study Field Test 
Results to Improve Item Construction 

International comparisons 

97-1 1 International Comparisons of Inservice Professional Development Dan Kasprzyk 

97-16 International Education Expenditure Comparability Study: Final Report, Volume I Shelley Bums 

97-17 International Education Expenditure Comparability Study: Final Report, Volume II, Shelley Bums 

Quantitative Analysis of Expenditure Comparability 

2001-01 Cross-National Variation in Educational Preparation for Adulthood: From Early Elvira Hausken 

Adolescence to Young Adulthood 

2001-07 A Comparison of the National Assessment of Educational Progress (NAEP), the Third Arnold Goldstein 

International Mathematics and Science Study Repeat (TIMSS-R), and the Programme 
for International Student Assessment (PISA) 

International comparisons - math and science achievement 

2001-05 Using TIMSS to Analyze Correlates of Performance Variation in Mathematics Patrick Gonzales 

Libraries 

94- 07 Data Comparability and Public Policy: New Interest in Public Library Data Papers 

Presented at Meetings of the American Statistical Association 

97- 25 1996 National Household Education Survey (NHES:96) Questionnaires: 

Screener/Household and Library, Parent and Family Involvement in Education and 
Civic Involvement, Youth Civic Involvement, and Adult Civic Involvement 

Limited English Proficiency 

95- 13 Assessing Students with Disabilities and Limited English Proficiency 

2001-1 1 Impact of Selected Background Variables on Students’ NAEP Math Performance 
2001-13 The Effects of Accommodations on the Assessment of LEP Students in NAEP 

Literacy of adults 

98- 17 Developing the National Assessment of Adult Literacy: Recommendations from 

Stakeholders 

1999-09a 1992 National Adult Literacy Survey: An Overview 

1999-09b 1992 National Adult Literacy Survey: Sample Design 

1999-09c 1992 National Adult Literacy Survey: Weighting and Population Estimates 

1999-09d 1992 National Adult Literacy Survey: Development of the Survey Instruments 

1999-09e 1992 National Adult Literacy Survey: Scaling and Proficiency Estimates 

1999-09f 1992 National Adult Literacy Survey: Interpreting the Adult Literacy Scales and Literacy 

Levels 

1999-09g 1992 National Adult Literacy Survey: Literacy Levels and the Response Probability 

Convention 

1999- 1 1 Data Sources on Lifelong Learning Available from the National Center for Education 

Statistics 

2000- 05 Secondary Statistical Modeling With the National Assessment of Adult Literacy: 

Implications for the Design of the Background Questionnaire 
2000-06 Using Telephone and Mail Surveys as a Supplement or Alternative to Door-to-Door 
Surveys in the Assessment of Adult Literacy 

2000-07 “How Much Literacy is Enough?” Issues in Defining and Reporting Performance 
Standards for the National Assessment of Adult Literacy 
2000-08 Evaluation of the 1992 NALS Background Survey Questionnaire: An Analysis of Uses 
with Recommendations for Revisions 

2000- 09 Demographic Changes and Literacy Development in a Decade 

2001- 08 Assessing the Lexile Framework: Results of a Panel Meeting 

Literacy of adults - international 

97- 33 Adult Literacy: An International Perspective Marilyn Binkley 

Mathematics 

98- 09 High School Curriculum Structure: Effects on Coursetaking and Achievement in Jeffrey Owings 

Mathematics for High School Graduates — An Examination of Data from the National 
Education Longitudinal Study of 1988 

1999-08 Measuring Classroom Instructional Processes: Using Survey and Case Study Field Test Dan Kasprzyk 

Results to Improve Item Construction 


Sheida White 

Alex Sedlacek 
Alex Sedlacek 
Alex Sedlacek 
Alex Sedlacek 
Alex Sedlacek 
Alex Sedlacek 

Alex Sedlacek 

Lisa Hudson 

Sheida White 

Sheida White 

Sheida White 

Sheida White 

Sheida White 
Sheida White 


James Houser 
Arnold Goldstein 
Arnold Goldstein 


Carrol Kindel 
Kathryn Chandler 



No. Title 

2001-05 Using TIMSS to Analyze Correlates of Performance Variation in Mathematics 
2001-07 A Comparison of the National Assessment of Educational Progress (NAEP), the Third 

International Mathematics and Science Study Repeat (TIMSS-R), and the Programme 
for International Student Assessment (PISA) 

2001- 1 1 Impact of Selected Background Variables on Students’ NAEP Math Performance 

2002- 06 The Measurement of Instructional Background Indicators: Cognitive Laboratory 

Investigations of the Responses of Fourth and Eighth Grade Students and Teachers to 
Questionnaire Items 

2002-07 Teacher Quality, School Context, and Student Race/Ethnicity: Findings from the Eighth 
Grade National Assessment of Educational Progress 2000 Mathematics Assessment 


Parental involvement in education 

96-03 National Education Longitudinal Study of 1988 (NELS:88) Research Framework and 
Issues 

1996 National Household Education Survey (NHES:96) Questionnaires: 

Screener/Household and Library, Parent and Family Involvement in Education and 
Civic Involvement, Youth Civic Involvement, and Adult Civic Involvement 
A Birth Cohort Study: Conceptual and Design Considerations and Rationale 
Papers from the Early Childhood Longitudinal Studies Program: Presented at the 2001 
AERA and SRCD Meetings 

The Measurement of Home Background Indicators: Cognitive Laboratory Investigations 
of the Responses of Fourth and Eighth Graders to Questionnaire Items and Parental 
Assessment of the Invasiveness of These Items 


97-25 


1999-01 

2001-06 

2001-19 


Participation rates 

98-10 Adult Education Participation Decisions and Barriers: Review of Conceptual Frameworks 
and Empirical Studies 

Postsecondary education 

1999- 1 1 Data Sources on Lifelong Learning Available from the National Center for Education 

Statistics 

2000- 16a Lifelong Learning NCES Task Force: Final Report Volume I 

2000-16b Lifelong Learning NCES Task Force: Final Report Volume II 

Postsecondary education - persistence and attainment 

98-1 1 Beginning Postsecondary Students Longitudinal Study First Follow-up (BPS:96-98) Field 
Test Report 

1999- 15 Projected Postsecondary Outcomes of 1992 High School Graduates 

Postsecondary education - staff 

97-26 Strategies for Improving Accuracy of Postsecondary Faculty Lists 

2000- 01 1999 National Study of Postsecondary Faculty (NSOPF:99 ) Field Test Report 

2002-08 A Profile of Part-time Faculty: Fall 1998 

Principals 

2000-10 A Research Agenda for the 1999-2000 Schools and Staffing Survey 

Private schools 

96- 16 Strategies for Collecting Finance Data from Private Schools 

97- 07 The Determinants of Per-Pupil Expenditures in Private Elementary and Secondary 

Schools: An Exploratory Analysis 

97-22 Collection of Private School Finance Data: Development of a Questionnaire 
2000-13 Non-professional Staff in the Schools and Staffing Survey (SASS) and Common Core of 
Data (CCD) 

2000-15 Feasibility Report: School-Level Finance Pretest. Private School Questionnaire 

Projections of education statistics 

1999-15 Projected Postsecondary Outcomes of 1992 High School Graduates 


Public school finance 

1999- 16 Measuring Resources in Education: From Accounting to the Resource Cost Model 

Approach 

2000- 18 Feasibility Report: School-Level Finance Pretest, Public School District Questionnaire 


NCES contact 
Patrick Gonzales 
Arnold Goldstein 


Arnold Goldstein 
Arnold Goldstein 


Janis Brown 


Jeffrey Owings 
Kathryn Chandler 

Jerry West 
Jerry West 

Arnold Goldstein 


Peter Stowe 


Lisa Hudson 

Lisa Hudson 
Lisa Hudson 


Aurora D’Amico 
Aurora D’Amico 


Linda Zimbler 
Linda Zimbler 
Linda Zimbler 


Dan Kasprzyk 


Stephen Broughman 
Stephen Broughman 

Stephen Broughman 
Kerry Gruber 

Stephen Broughman 


Aurora D’Amico 


William J. Fowler, Jr. 
Stephen Broughman 



No. 


Title 


NCES contact 


Public schools 

97- 43 Measuring Inflation in Public School Costs William J. Fowler, Jr. 

98- 01 Collection of Public School Expenditure Data: Development of a Questionnaire Stephen Broughman 

98-04 Geographic Variations in Public Schools’ Costs William J. Fowler, Jr. 

1999- 02 Tracking Secondary Use of the Schools and Staffing Survey Data: Preliminary Results Dan Kasprzyk 

2000- 12 Coverage Evaluation of the 1994-95 Public Elementary/Secondary School Universe Beth Young 

Survey 

2000-13 Non-professional Staff in the Schools and Staffing Survey (SASS) and Common Core of Kerry Gruber 

Data (CCD) 

2002-02 Locale Codes 1987 - 2000 Frank Johnson 

Public schools - secondary 

98-09 High School Curriculum Structure: Effects on Coursetaking and Achievement in Jeffrey Owings 

Mathematics for High School Graduates — An Examination of Data from the National 
Education Longitudinal Study of 1988 

Reform, educational 

96-03 National Education Longitudinal Study of 1988 (NELS:88) Research Framework and Jeffrey Owings 

Issues 

Response rates 

98-02 Response Variance in the 1993-94 Schools and Staffing Survey: A Reinterview Report Steven Kaufman 

School districts 

2000-10 A Research Agenda for the 1999-2000 Schools and Staffing Survey Dan Kasprzyk 

School districts, public 

98-07 Decennial Census School District Project Planning Report 

1999-03 Evaluation of the 1996-97 Nonfiscal Common Core of Data Surveys Data Collection, 

Processing, and Editing Cycle 

School districts, public - demographics of 

96- 04 Census Mapping Project/School District Data Book Tai Phan 

Schools 

97- 42 Improving the Measurement of Staffing Resources at the School Level: The Development 

of Recommendations for NCES for the Schools and Staffing Survey (SASS) 

98- 08 The Redesign of the Schools and Staffing Survey for 1999-2000: A Position Paper 

1999- 03 Evaluation of the 1996-97 Nonfiscal Common Core of Data Surveys Data Collection, 

Processing, and Editing Cycle 

2000- 10 A Research Agenda for the 1999-2000 Schools and Staffing Survey 

2002-02 Locale Codes 1987-2000 

2002-07 Teacher Quality, School Context, and Student Race/Ethnicity: Findings from the Eighth 
Grade National Assessment of Educational Progress 2000 Mathematics Assessment 

Schools - safety and discipline 

97- 09 Status of Data on Crime and Violence in Schools: Final Report Lee Hoffman 

Science 

2000- 1 1 Financial Aid Profile of Graduate Students in Science and Engineering 

2001- 07 A Comparison of the National Assessment of Educational Progress (NAEP), the Third 

International Mathematics and Science Study Repeat (TIMSS-R), and the Programme 
for International Student Assessment (PISA) 

Software evaluation 

2000-03 Strengths and Limitations of Using SUDAAN, Stata, and WesVarPC for Computing Ralph Lee 

Variances from NCES Data Sets 

Staff 

97^12 Improving the Measurement of Staffing Resources at the School Level: The Development Mary Rollefson 
of Recommendations for NCES for the Schools and Staffing Survey (SASS) 

98- 08 The Redesign of the Schools and Staffing Survey for 1999-2000: A Position Paper 


Aurora D’Amico 
Arnold Goldstein 


Mary Rollefson 

Dan Kasprzyk 
Beth Young 

Dan Kasprzyk 
Frank Johnson 
Janis Brown 


Tai Phan 
Beth Young 


Dan Kasprzyk 



No. 


Title 


NCES contact 


Staff - higher education institutions 

97-26 Strategies for Improving Accuracy of Postsecondary Faculty Lists 
2002-08 A Profile of Part-time Faculty: Fall 1998 

Staff - nonprofessional 

2000- 13 Non-professional Staff in the Schools and Staffing Survey (SASS) and Common Core of 

Data (CCD) 

State 

1999-03 Evaluation of the 1996-97 Nonfiscal Common Core of Data Surveys Data Collection, 
Processing, and Editing Cycle 

Statistical methodology 

97-21 Statistics for Policymakers or Everything You Wanted to Know About Statistics But 
Thought You Could Never Understand 

Statistical standards and methodology 

2001- 05 Using TIMSS to Analyze Correlates of Performance Variation in Mathematics 

2002- 04 Improving Consistency of Response Categories Across NCES Surveys 

Students with disabilities 

95- 13 Assessing Students with Disabilities and Limited English Proficiency 

2001-13 The Effects of Accommodations on the Assessment of LEP Students in NAEP 

Survey methodology 

96- 17 National Postsecondary Student Aid Study: 1996 Field Test Methodology Report 

97- 15 Customer Service Survey: Common Core of Data Coordinators 

97- 35 Design, Data Collection, Interview Administration Time, and Data Editing in the 1996 

National Household Education Survey 

98- 06 National Education Longitudinal Study of 1988 (NELS:88) Base Year through Second 

Follow-Up: Final Methodology Report 

98-1 1 Beginning Postsecondary Students Longitudinal Study First Follow-up (BPS:96-98) Field 
Test Report 

98-16 A Feasibility Study of Longitudinal Design for Schools and Staffing Survey 
1999-07 Collection of Resource and Expenditure Data on the Schools and Staffing Survey 

1999- 17 Secondary Use of the Schools and Staffing Survey Data 

2000- 01 1999 National Study of Postsecondary Faculty (NSOPF:99 ) Field Test Report 

2000-02 Coordinating NCES Surveys: Options, Issues, Challenges, and Next Steps 

2000-04 Selected Papers on Education Surveys: Papers Presented at the 1998 and 1999 ASA and 

1999 AAPOR Meetings 

2000-12 Coverage Evaluation of the 1994—95 Public Elementary/Secondary School Universe 
Survey 

2000- 17 National Postsecondary Student Aid Study:2000 Field Test Methodology Report 

2001- 04 Beginning Postsecondary Students Longitudinal Study: 1996-2001 (BPS: 1996/2001) 

Field Test Methodology Report 

2001-07 A Comparison of the National Assessment of Educational Progress (NAEP), the Third 

International Mathematics and Science Study Repeat (TIMSS-R), and the Programme 
for International Student Assessment (PISA) 

2001-1 1 Impact of Selected Background Variables on Students’ NAEP Math Performance 
2001-13 The Effects of Accommodations on the Assessment of LEP Students in NAEP 

2001- 19 The Measurement of Home Background Indicators: Cognitive Laboratory Investigations 

of the Responses of Fourth and Eighth Graders to Questionnaire Items and Parental 
Assessment of the Invasiveness of These Items 

2002- 01 Legal and Ethical Issues in the Use of Video in Education Research 

2002-02 Locale Codes 1987 - 2000 

2002-03 National Postsecondary Student Aid Study, 1999-2000 (NPSAS:2000), CATI 
Nonresponse Bias Analysis Report. 


Linda Zimbler 
Linda Zimbler 


Kerry Gruber 


Beth Young 


Susan Ahmed 


Patrick Gonzales 
Marilyn Seastrom 


James Houser 
Arnold Goldstein 


Andrew G. Malizio 
Lee Hoffman 
Kathryn Chandler 

Ralph Lee 

Aurora D’Amico 

Stephen Broughman 
Stephen Broughman 
Susan Wiley 
Linda Zimbler 
Valena Plisko 
Dan Kasprzyk 

Beth Young 

Andrew G. Malizio 
Paula Knepper 

Arnold Goldstein 


Arnold Goldstein 
Arnold Goldstein 
Arnold Goldstein 


Patrick Gonzales 
Frank Johnson 
Andrew Malizio 



No. Title 

2002- 06 The Measurement of Instructional Background Indicators: Cognitive Laboratory 

Investigations of the Responses of Fourth and Eighth Grade Students and Teachers to 
Questionnaire Items 

2003- 03 Education Longitudinal Study: 2002 (ELS: 2002) Field Test Report 

Teachers 

98-13 Response Variance in the 1994-95 Teacher Follow-up Survey 

1999- 14 1994-95 Teacher Followup Survey: Data File User’s Manual, Restricted-Use Codebook 

2000- 10 A Research Agenda for the 1999-2000 Schools and Staffing Survey 

2002-07 Teacher Quality, School Context, and Student Race/Ethnicity: Findings from the Eighth 
Grade National Assessment of Educational Progress 2000 Mathematics Assessment 

Teachers - instructional practices of 

98-08 The Redesign of the Schools and Staffing Survey for 1999-2000: A Position Paper 
2002-06 The Measurement of Instructional Background Indicators: Cognitive Laboratory 

Investigations of the Responses of Fourth and Eighth Grade Students and Teachers to 
Questionnaire Items 

Teachers - opinions regarding safety 

98-08 The Redesign of the Schools and Staffing Survey for 1999-2000: A Position Paper 

Teachers - performance evaluations 

1999-04 Measuring Teacher Qualifications 

Teachers - qualifications of 

1999- 04 Measuring Teacher Qualifications 

Teachers - salaries of 

94- 05 Cost-of-Education Differentials Across the States 

Training 

2000- 16a Lifelong Learning NCES Task Force: Final Report Volume I 

2000-16b Lifelong Learning NCES Task Force: Final Report Volume II 

Variance estimation 

2000-03 Strengths and Limitations of Using SUDAAN, Stata, and WesVarPC for Computing 
Variances from NCES Data Sets 

2000- 04 Selected Papers on Education Surveys: Papers Presented at the 1998 and 1999 ASA and 

1999 AAPOR Meetings 

2001- 18 A Study of Variance Estimation Methods 

Violence 

97-09 Status of Data on Crime and Violence in Schools: Final Report 

Vocational education 

95- 12 Rural Education Data User’s Guide 

1999-05 Procedures Guide for Transcript Studies 
1999-06 1998 Revision of the Secondary School Taxonomy 


NCES contact 
Arnold Goldstein 


Jeffrey Owings 

Steven Kaufman 
Kerry Gruber 
Dan Kasprzyk 
Janis Brown 


Dan Kasprzyk 
Arnold Goldstein 


Dan Kasprzyk 
Dan Kasprzyk 
Dan Kasprzyk 
William J. Fowler, Jr. 


Lisa Hudson 
Lisa Hudson 


Ralph Lee 
Dan Kasprzyk 
Ralph Lee 

Lee Hoffman 


Samuel Peng 
Dawn Nelson 
Dawn Nelson 



